mask rcnn 超详细代码解读（二）

您所在的位置：网站首页 › mask rcnn详解 › mask rcnn 超详细代码解读（二）

mask rcnn 超详细代码解读（二）

2023-11-12 13:39| 来源: 网络整理| 查看: 265

前文： mask rcnn 超详细代码解读（一）

文章目录 1 解析(一)中网络结构总结2 train过程代码继续解析2.1 ROIAlign Layer2.2 Detection Target Layer 3 关于代码中用到的索引示例一示例二

1 解析(一)中网络结构总结

（一）中解析了Resnet Graph、Region Proposal Network (RPN)、Proposal Layer三个部分。（MaskRCNN Class 层会把大家都关联起来）

Resnet Graph是一系列的卷积，它的目的就是提取特征。图片输入网络，首先通过Resnet Graph提取特征，得到 [C1, C2, C3, C4, C5]，这些特征是后面网络的基础。

在后文的 MaskRCNN Class 层解析中会发现， Resnet Graph得到的C系列特征分别处理后得到P系列特征，即 [P1, P2, P3, P4, P5]，然后 P5 再通过maxpooling得到P6， [P1, P2, P3, P4, P5, P6] 作为 feature_map输入 Region Proposal Network (RPN) 得到 rpn_class_logits 、 rpn_class 、 rpn_bbox 。

这些输出结果再输入 Proposal Layer 就可以得到 proposals 了。解析一中三部分的联系

以上就是（一）中所解析的三部分的关联（如上图），接下来将继续分析 ROIAlign Layer 、Detection Target Layer、Feature Pyramid Network Heads 这三层的结构。

2 train过程代码继续解析 2.1 ROIAlign Layer

ROIAlign 是最不好理解的一部分，代码中ROIAlign包括两个部分：

定义函数 def log2_graph(x) ：这个是因为 TensorFlow 中竟然没有求 l o g 2 x log_2x log2x的方法，所以代码自己定义了一个方法来计算，直接返回 tf.log(x) / tf.log(2.0)定义类 class PyramidROIAlign(KE.Layer) ：同样继承 KE.Layer，目的是让TensorFlow 处理的数据流可以让 keras 接着处理，具体前文已有说明，这里不再赘述。

下文对 PyramidROIAlign 类进行解析。

首先是 __init__ 方法：

def __init__(self, pool_shape, **kwargs): super(PyramidROIAlign, self).__init__(**kwargs) self.pool_shape = tuple(pool_shape)

由源码可知，在实例化PyramidROIAlign 类时，需要传入一个 pool_shape 参数。这个参数非常重要，它决定了 ROIAlign 层输出的特征的shape，一般 pool_shape=(7, 7)，也就是说，不管输入特征的大小是多少，输出特征大小必然是 7x7（不考虑通道数）。

这点非常重要。因为 mask rcnn 的设定是，可以输入任意尺寸的图片。对于卷积来说，该层的参数量 = 卷积核高x卷积核宽x卷积核数量（通道数），其中卷积核的高和宽是设定的参数，通道数是超参数，输入图片大小不会影响卷积层的参数量，只是输出的特征大小不同罢了，不管输入多大的图片都能算（也就是不会报错）。

但是对于dense层来说，输入图片大小不同，参数量是不一样的。在分类时，网络最后要接 dense 层，要确保输入 dense 的 feature 大小一致。但是输入mask rcnn 的图片大小又是不确定的，那该怎么办呢？？？

所以，这就是 PyramidROIAlign 的重要作用了：不管输入该层的特征大小为多少，经过该层之后，一律变成固定值（即 pool_size ，一般是 7x7）。核心科技就是调用了这个方法：tf.crop_and_resize（另外，不是把整张输入图片的特征变成 7x7 ，如果是那样就只有resize没有corp了。 PyramidROIAlign 的功能是，根据显著性物体的bbox坐标，以及显著性物体相对于整张图片面积的大小，在不同尺寸的特征图上切出显著性对象的特征。可以结合代码理解这个过程。）

来看 PyramidROIAlign 具体是怎么做的，也就是 call(self, inputs) 方法。

输入：

boxes : [batch, num_boxes, (y1, x1, y2, x2)] 其中坐标是归一化的image_meta : [batch, (meta data)] 储存了图片的一些原始信息，之前（一）已经说明过feature_maps : 特征金字塔，每个的shape都是 [batch, height, width, channels]

输出：

pool 后固定大小的特征：[batch, num_boxes, pool_height, pool_width, channels]

代码流程：

（1）初始化，从 input 中获取 bboxes 、image_meta 、feature_maps：

def call(self, inputs): # num_boxes指的是proposal数目 # 通过循环特征层寻找符合的proposal，应用于ROIAlign # Crop boxes [batch, num_boxes, (y1, x1, y2, x2)] in normalized coords boxes = inputs[0] print('boxes:',boxes) # Image meta # Holds details about the image. See compose_image_meta() image_meta = inputs[1] # Feature Maps. List of feature maps from different level of the # feature pyramid. Each is [batch, height, width, channels] feature_maps = inputs[2:]

其中：

boxes：shape = [batch, num_boxes, (y1, x1, y2, x2)]，这里坐标都经过了归一化处理。input_meta：里面包含了各种图片信息，包括原输入图片的大小、图片id之类的（虽然只有 image_shape 会用到。。。）这个是通过 compose_image_meta 方法生成的，可以用 parse_image_meta(meta) 获得meta中的数据，这两个方法在解读（一）中已经说明。feature_maps：是通过Resnet Graph提取到的特征，每个的shape都是[batch, height, width, channels]

什么？你问这个参数怎么传进去的，当然是：

layer = PyramidROIAlign(7,7)([bboxes, image_meta, feature_maps])

（2）根据 image_meta 中携带的原图面积信息，得到现在处理的这张图片应该在哪一个特征图中 pooling。

def call(self, inputs): # （1）初始化，从 `input` 中获取 bboxes 、image_meta 、feature_maps ... # 初始化代码省略 # Assign each ROI to a level in the pyramid based on the ROI area. # 这里的boxes是ROI的框，用来计算得到每个ROI框的面积 y1, x1, y2, x2 = tf.split(boxes, 4, axis=2) h = y2 - y1 # h.shape=[batch,num_boxes,1] w = x2 - x1 # Use shape of first image. Images in a batch must have the same size. # 这里得到原图的尺寸，计算原图的面积 image_shape = parse_image_meta_graph(image_meta)['image_shape'][0] # Equation 1 in the Feature Pyramid Networks paper. Account for # the fact that our coordinates are normalized here. # e.g. a 224x224 ROI (in pixels) maps to P4 # 原图面积 image_area = tf.cast(image_shape[0] * image_shape[1], tf.float32) # 分成两步计算每个ROI框需要在哪个层的特征图中进行pooling roi_level = log2_graph(tf.sqrt(h * w) / (224.0 / tf.sqrt(image_area))) # h,w已经归一化 roi_level = tf.minimum(5, tf.maximum( 2, 4 + tf.cast(tf.round(roi_level), tf.int32))) # 确保值位于2-5之间 roi_level = tf.squeeze(roi_level, 2) # roi_level.shape=[batch,num_boxes,1]

这里增加一点解释：

为啥要计算 roi_level ？

roi_level （记为k）的计算方法是： k = k 0 + l o g 2 ( w ∗ h 244 ) k=k_0+log_2(\frac{\sqrt{w*h}}{244}) k=k0+log2(244w∗h )这里 w 和 h 分别是显著性物体的绑定框的宽和高，所以 w*h 是显著性物体的大小。244是预训练的 Image Net 的输入大小，比如 k 0 k_0 k0=4，那么，w*h=244时，k=4，该显著性对象的特征从特征金字塔中的 P4 中 crop。

如果显著性物体占原图面积大，则在更“深”（也就是卷积次数更多）的特征图（比如P5）上切割，如果显著性物体是个不起眼的小东西，比如 k 0 k_0 k0=4，w*h=112，则 k=3，小的显著性物体在更“浅”的特征图上切割（比如P3）。这样有利于检测不同尺寸的目标。

计算ROI在哪个特征图中进行Pooling的结果储存在 roi_level 里面的，roi_level.shape=[batch,num_boxes,1]

（3）循环 feature_maps，在feature_maps中用 tf.image.crop_and_resize 函数得到 pooled，存入list：

def call(self, inputs): #（1）初始化，从 `input` 中获取 bboxes 、image_meta 、feature_maps ... # 初始化代码省略 #（2）根据 image_meta 中携带的原图面积信息，得到现在处理的这张图片应该在哪一个特征图中 pooling ... # 代码省略 # Loop through levels and apply ROI pooling to each. P2 to P5. # 使用得到的5个融合了不同层级的特征图 pooled = [] box_to_level = [] # box_to_level[i, 0]表示的是当前feat隶属的图片索引，box_to_level[i, 1]表示的是其box序号 for i, level in enumerate(range(2, 6)): # 只使用2-5四个特征图 # 先找出需要在第level层计算ROI # tf.where 返回格式 [坐标1，坐标1...] # np.where 返回格式 [[坐标1.x, 坐标2.x...], [坐标1.y, 坐标2.y...]] # 返回第n张图片的第i个proposal坐标（n对应batch坐标，i对应num_boxes那一维的坐标） ix = tf.where(tf.equal(roi_level, level)) # ix是一个坐标集，每个坐标有三个数字，第三位数必然是0（因为roi_level.shape=[batch,num_boxes,1]）。 # level_boxes 记录对应的level特征层中分配到的每个box的坐标（候选框索引对应的图片） # box_indices 记录每个box对应的图片在batch中的索引（候选框索引对应其坐标即小黑框的坐标） level_boxes = tf.gather_nd(boxes, ix) # [本level的proposal数目，4] # Box indices for crop_and_resize. box_indices = tf.cast(ix[:, 0], tf.int32) # 记录每个proposal对应图片序号 # ↑ 取 ix[:,0]是tf.image.crop_and_resize传参需要 # Keep track of which box is mapped to which level box_to_level.append(ix) # Stop gradient propogation to ROI proposals # level_boxes和box_indices本身属于RPN计算出来结果， # 但是两者作用于feature后的输出Tensor却是RCNN部分的输入， # 两部分的梯度不能相互流通的，所以需要tf.stop_gradient()截断梯度传播。 level_boxes = tf.stop_gradient(level_boxes) box_indices = tf.stop_gradient(box_indices) # Crop and Resize # From Mask R-CNN paper: "We sample four regular locations, so # that we can evaluate either max or average pooling. In fact, # interpolating only a single value at each bin center (without # pooling) is nearly as effective." # # Here we use the simplified approach of a single value per bin, # which is how it's done in tf.crop_and_resize() # Result: [batch * num_boxes, pool_height, pool_width, channels] # 调用API双线性插值 # tf.image.crop_and_resize的参数说明： # - image: 表示特征图 # - boxes：指需要划分的区域，输入格式为[ymin，xmin，ymax，xmax] 归一化 # - box_ind: 是boxes和image之间的索引,形状为[num_boxes]的1维张量,box_ind[i]值指定第i个方框要引用的图像 # - crop_size: 表示RoiAlign之后的大小 pooled.append(tf.image.crop_and_resize( feature_maps[i], level_boxes, box_indices, self.pool_shape, method="bilinear")) # 输入参数shape: # [batch, image_height, image_width, channels] # [this_level_num_boxes, 4] # [this_level_num_boxes] # [height, pool_width] # Pack pooled features into one tensor # 对每个box，都提取其中每一层特征图上该box对应的特征，然后组成一个大的特征表pooled pooled = tf.concat(pooled, axis=0) # Pack box_to_level mapping into one array and add another # column representing the order of pooled boxes box_to_level = tf.concat(box_to_level, axis=0) box_range = tf.expand_dims(tf.range(tf.shape(box_to_level)[0]), 1) box_to_level = tf.concat([tf.cast(box_to_level, tf.int32), box_range], axis=1)

关于 tf.image.crop_and_resize 这个关键函数的补充说明：这个函数会先按输入参数 [ymin，xmin，ymax，xmax] 在图上通过索引切出一部分，然后把这部分resize成你想要的大小，比如： tf.corp_and_resize说明另外，索引那段代码（就是 ix 有关的那段代码）不好理解，可以看本文第三部分索引详解的示例一（讲道理不理解也行，不影响理解整个 mask rcnn 的代码思路，但是理解了有助于以后自己写代码使用索引）

(4)调整shape顺序，得到形如 [batch, num_bbox, pool_height, pool_width, channels]的输出;

def call(self, inputs): ... #（1）（2）（3）代码省略 # 截止到目前，我们获取了记录全部ROIAlign结果feat集合的张量pooled，和记录这些feat相关信息的张量box_to_level， # 由于提取方法的原因，此时的feat并不是按照原始顺序排序（先按batch然后按box index排序） # 下面我们设法将之恢复顺序（ROIAlign作用于对应图片的对应proposal生成feat） # Rearrange pooled features to match the order of the original boxes # Sort box_to_level by batch then box index # TF doesn't have a way to sort by two columns, so merge them and sort. # box_to_level[i, 0]表示的是当前feat隶属的图片索引，box_to_level[i, 1]表示的是其box序号 sorting_tensor = box_to_level[:, 0] * 100000 + box_to_level[:, 1] ix = tf.nn.top_k(sorting_tensor, k=tf.shape( box_to_level)[0]).indices[::-1] ix = tf.gather(box_to_level[:, 2], ix) pooled = tf.gather(pooled, ix) # Re-add the batch dimension shape = tf.concat([tf.shape(boxes)[:2], tf.shape(pooled)[1:]], axis=0) pooled = tf.reshape(pooled, shape) return pooled 2.2 Detection Target Layer

Detection Target Layer 的输入（ gt 指 ground truth）：

proposals: [POST_NMS_ROIS_TRAINING, (y1, x1, y2, x2)] 坐标是归一化的，如果该图片生成的实际 proposal 数量不足，会补零到固定值gt_class_ids: [MAX_GT_INSTANCES] int class IDsgt_boxes: [MAX_GT_INSTANCES, (y1, x1, y2, x2)] 坐标是归一化的gt_masks: [height, width, MAX_GT_INSTANCES] of boolean type.

rois: [TRAIN_ROIS_PER_IMAGE, (y1, x1, y2, x2)] 坐标是归一化的class_ids: [TRAIN_ROIS_PER_IMAGE]. Integer class IDs. 数量不足会补零到固定值。deltas: [TRAIN_ROIS_PER_IMAGE, (dy, dx, log(dh), log(dw))]masks: [TRAIN_ROIS_PER_IMAGE, height, width]. 这些 mask 是 cropped 成对应的 bbox 框并且 resized 到网络输出大小的掩码。

有三个部分：

overlaps_graph(boxes1, boxes2) 方法：计算两个box之间重叠的部分，也就是IoU值。这部分代码简单，略过。detection_targets_graph 方法：detection的主要处理流程DetectionTargetLayer 类

注意这部分是没有可训练参数的（也就是没有卷积操作，可训练参数=0）该层的目的是根据 proposals 的坐标和标注的数据，计算得到 rois 坐标、proposals的坐标偏离值 deltas、掩码。

可能下面的代码比较绕，看得人都麻了，建议先看本文第三部分的 “索引解释” ，熟练掌握 tf.where 和 tf.gather 和 tf.gather_nd 的用法（可以参考这篇博客）

下面以计算 delta 为例，画一个代码计算思路图，配合代码。别的计算就同理。 delta计算思路图 detection_targets_graph 代码具体实现流程：（1）remove zero padding，去掉 gt_class_ids 和 gt_masks、proposals、gt_boxes中的0（gt是 ground truth 的简写）

def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config): """Generates detection targets for one image. Subsamples proposals and generates target class IDs, bounding box deltas, and masks for each. Inputs: proposals: [POST_NMS_ROIS_TRAINING, (y1, x1, y2, x2)] in normalized coordinates. Might be zero padded if there are not enough proposals. gt_class_ids: [MAX_GT_INSTANCES] int class IDs gt_boxes: [MAX_GT_INSTANCES, (y1, x1, y2, x2)] in normalized coordinates. gt_masks: [height, width, MAX_GT_INSTANCES] of boolean type. Returns: Target ROIs and corresponding class IDs, bounding box shifts, and masks. rois: [TRAIN_ROIS_PER_IMAGE, (y1, x1, y2, x2)] in normalized coordinates class_ids: [TRAIN_ROIS_PER_IMAGE]. Integer class IDs. Zero padded. deltas: [TRAIN_ROIS_PER_IMAGE, (dy, dx, log(dh), log(dw))] masks: [TRAIN_ROIS_PER_IMAGE, height, width]. Masks cropped to bbox boundaries and resized to neural network output size. Note: Returned arrays might be zero padded if not enough target ROIs. """ # Assertions asserts = [ tf.Assert(tf.greater(tf.shape(proposals)[0], 0), [proposals], name="roi_assertion"), ] with tf.control_dependencies(asserts): proposals = tf.identity(proposals) # Remove zero padding proposals, _ = trim_zeros_graph(proposals, name="trim_proposals") gt_boxes, non_zeros = trim_zeros_graph(gt_boxes, name="trim_gt_boxes") gt_class_ids = tf.boolean_mask(gt_class_ids, non_zeros, name="trim_gt_class_ids") gt_masks = tf.gather(gt_masks, tf.where(non_zeros)[:, 0], axis=2, name="trim_gt_masks")

（2）处理 crowds (a crowd refers to a bounding box around several instances)，用 tf.where 得到 crowd_id，然后用 tf.gather 得到 crowd_boxes，以及用 non_crowd_ix 得到 gt_class_id、gt_boxes、gt_masks

（在代码中区分 crowd 和 non_crowd 的方法是：gt_class_id=0 是 crowd；gt_class_id>0 是 non_crowd）

def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config): # Remove zero padding ... # 代码省略 # Handle COCO crowds # A crowd box in COCO is a bounding box around several instances. Exclude # them from training. A crowd box is given a negative class ID. # crowd_ix 是 gt_class_id=0 的位置 crowd_ix = tf.where(gt_class_ids 0)[:, 0] crowd_boxes = tf.gather(gt_boxes, crowd_ix) gt_class_ids = tf.gather(gt_class_ids, non_crowd_ix) gt_boxes = tf.gather(gt_boxes, non_crowd_ix) gt_masks = tf.gather(gt_masks, non_crowd_ix, axis=2)

其中 tf.gather 的用法：tf.gather(params,indices,axis=0 )，从params的axis维根据indices的参数值获取切片。这里 indices 通过 tf.where 得到。这个函数在下文的 “索引示例” 中会用到，可以结合索引示例理解。

补充 tf.where 的用法说明及示例：

tf.where(condition, x=None, y=None, name=None) # condition， x, y 相同维度，condition是bool型值 # 返回condition中元素为True对应的索引 >>> condition1 = [[True,False,False], [False,True,True]] [[0 0] [1 1] [1 2]] # 如果有 x y 输入，condition为True用x的对应位置替换，为False则用y # 下例： import tensorflow as tf x = [[1,2,3],[4,5,6]] y = [[7,8,9],[10,11,12]] condition3 = [[True,False,False], [False,True,True]] condition4 = [[True,False,False], [True,True,False]] with tf.Session() as sess: print(sess.run(tf.where(condition3,x,y))) print(sess.run(tf.where(condition4,x,y))) # 输出： 1， [[ 1 8 9] [10 5 6]] 2， [[ 1 8 9] [ 4 5 12]]

（3）计算 proposals 和 gt_boxes（经过上一步后，gt_boxes都是non_crowd框）的重叠 IoU，存在 overlaps 中

def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config): # Remove zero padding # Handle COCO crowds ... # 代码省略 # Compute overlaps matrix [proposals, gt_boxes] overlaps = overlaps_graph(proposals, gt_boxes)

（4）计算 proposals 和 crowd_boxes 的重叠 IoU，存在 crowd_overlaps 中。

def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config): # Remove zero padding # Handle COCO crowds # Compute overlaps matrix [proposals, gt_boxes] ... # 代码省略 # Compute overlaps with crowd boxes [proposals, crowd_boxes] crowd_overlaps = overlaps_graph(proposals, crowd_boxes) crowd_iou_max = tf.reduce_max(crowd_overlaps, axis=1) no_crowd_bool = (crowd_iou_max =0.5 ②negative是指与 gt_boxes的最大IoU= 0.5) positive_indices = tf.where(positive_roi_bool)[:, 0] # 2. Negative ROIs are those with < 0.5 with every GT box. Skip crowds. negative_indices = tf.where(tf.logical_and(roi_iou_max

【本文地址】

mask rcnn 超详细代码解读（二）

mask rcnn 超详细代码解读（二）

今日新闻

推荐新闻